Sentiment Analysis

On Amazon Fine Food Reviews Data

Author

Aliyu Atiku Mustapha

Published

April 9, 2024

Introduction

The objective of this analysis is to answer 5 business questions using the Amazon Fine Food Reviews Data to monitor and manage brand reputation by analyzing customer sentiment and identifying areas for improvement..

  1. What is the overall sentiment distribution of the reviews?

  2. How does the sentiment change over time?

  3. Are there any specific products that consistently receive positive or negative reviews?

  4. Do certain review lengths tend to have more positive or negative sentiment?

  5. How does the sentiment differ between different user ratings (e.g., 1-star, 5-star)?

Amazon Fine Food Reviews Data

The dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all 568,454 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review.

The dataset is available on Kaggle. The data has the following columns and their descriptions:

Column Name Column Description
Id A unique identifier for each review.
ProductId The unique identifier of the product being reviewed
UserId The unique identifier of the user who wrote the review
ProfileName The profile name of the user who wrote the review
HelpfulnessNumerator The number of users who found the review helpful
HelpfulnessDenominator The total number of users who voted on the review’s helpfulness
Score The rating given by the user (ranging from 1 to 5, where 5 is the highest)
Time The timestamp of when the review was posted
Summary A brief summary of the review’s content
Text The main body of the review, containing the detailed comments and opinions
Click to show code
# Load all required packages
# Load tidyverse package
library(tidyverse)
# Load naniar package
library(naniar)
# Load tidytext package
library(tidytext)
# Load DT package for table display
library(DT)
# Load TM package
library(tm)
# Load textclean package
library(textclean)
# Load textstem package
library(textstem)
# Load plotly package
library(plotly)
# Set default theme
theme_set(theme_minimal())

Data Loading

Click to show code
amzn_data <- read_csv("Reviews.csv")
# Display the first 500 rows of imported data
datatable(
  amzn_data[1:500,],
  filter = "top",
  caption = "The first 500 rows of loaded data.",
  options = list(pageLength = 50,
                 scrollY = "500px",
                 scrollX = TRUE))

The review dataset has 568454 rows. Lets view the data to understand its structure and composition

Data Properties

To understand the data composition, the class of each variable together with the number and proportion of missing values for each variable will provide a deeper insight to the data structure and how it could be useful for analysis.

Click to show code
# Visualize missing data
gg_miss_var(amzn_data) + theme(text = element_text(size = 14),
                               axis.text.x = element_text(size = 12),
                               axis.text.y = element_text(size = 12))

Click to show code
# Check for class of each variable
class_table <- sapply(amzn_data, class)
class_table <- data.frame(
  Variable = names(class_table),
  Class = as.character(class_table),
  stringsAsFactors = FALSE)
# Check for the proportion of missing values in full data
x <- miss_var_summary(amzn_data)
# Rename the columns 
colnames(x) <- c("Variable", "Values_missing", "Proportion_missing")
# Round the Proportion_missing column to 2 decimal points
x$Proportion_missing <- round(x$Proportion_missing, 2)
# Combine data frames on the "Variable" column
properties <- merge(class_table, x, by = "Variable")
# Display the properties of the data
datatable(
  properties,
  caption = "Table displaying the properties of the data.")

Note: Data has some missing values, only 4 ‘Profile Names’, which is negligible and can be overlooked.

Data Cleaning and Processing

Next, we’ll clean and reformat the data by creating new variables and dropping those not needed, especially those that do not provide any valuable information for sentiment classification, then prepare for analysis.

Click to show code
amzn_cleaned <- amzn_data %>%
# Step 1:Convert Time to POSIXct format and remove the day and time parts
  mutate(Time = as.POSIXct(Time, origin = "1970-01-01"),
    Time = format(Time, "%Y-%m")
    ) %>%
# Step 2: Rename Time column to Year_Month
  rename(Year_Month = Time) %>%
# Step 3: Remove columns that do not provide any valuable information for sentiment classification
 select(-Id, -ProfileName, -HelpfulnessNumerator, -HelpfulnessDenominator)
# Step 4: Convert text to lowercase
amzn_cleaned$Text <- tolower(amzn_cleaned$Text)
# Step 5: Remove punctuation and special characters
amzn_cleaned$Text <- removePunctuation(amzn_cleaned$Text)
# Step 6: Remove white spaces
amzn_cleaned$Text <- stripWhitespace(amzn_cleaned$Text)
# Step 7: Remove English "stopwords"
amzn_cleaned$Text <-
  removeWords(amzn_cleaned$Text, stopwords("en"))
# Display the first 500 rows of the cleaned and reformatted data
datatable(
   amzn_cleaned[1:500,],
   caption = "First 500 rows of cleaned data with new created variables.",
   options = list(pageLength = 50,
                 scrollY = "500px",
                 scrollX = TRUE))

Notice how common words like “this”, “is”, and punctuation are removed from the text column, to only focus on the meaningful content of the text, which helps in reducing noise and improving accuracy of the sentiment analysis.


Sentiment Analysis

The 5 Business questions will now be answered by examining the dataset.

What is the overall sentiment distribution of the reviews?

Click to show code
# Classify sentiment based on the customer 'Score'. Score above 3 are positive, scores below 3 are negative and scores that are 3 are neutral
amzn_cleaned$Sentiment <-
  ifelse(amzn_cleaned$Score > 3,"positive",
    ifelse(amzn_cleaned$Score < 3, "negative", 
           "neutral")
  )
# Count the number of reviews in each sentiment category
sentiment_counts <- amzn_cleaned %>%
  group_by(Sentiment) %>%
  summarize(Count = n())
# Plot the sentiment distribution
sent_dist <- ggplot(sentiment_counts, aes(x = Sentiment, y = Count, 
                                     fill = Sentiment)) + geom_col()  +
        scale_fill_manual(values = c(
          "negative" = "red",
          "neutral" = "blue",
          "positive" = "green"
        )) +
        labs(title = "Sentiment Distribution", x = "Sentiment type", 
             y = "Count")  +
        theme(
          text = element_text(size = 14),
          axis.text.x = element_text(size = 12),
          axis.text.y = element_text(size = 12),
          legend.position = "top",
          legend.title = element_blank(),
          legend.text = element_text(size = 14)
        )
sent_dist

The overall sentiment distribution shows that there are more positive reviews in the data set.

How does the sentiment change over time?

Click to show code
# Monthly sentiment distribution
monthly_sentiment <- amzn_cleaned %>%
            group_by(Year_Month, Sentiment) %>%
            mutate(Sentiment = as.factor(Sentiment)) %>%
            summarize(Count = n())
# Plot sentiment change over time
sent_time <- ggplotly(
              ggplot(
                monthly_sentiment,
                aes(x = Year_Month, y = Count, 
                    color = Sentiment, 
                    group = Sentiment)) +
                geom_line(size = 0.8) +
                theme(text = element_text(size = 14),
                      axis.text.x = element_blank(),
                      legend.title = element_blank(),
                      legend.background = element_rect(fill = "white",
                                                       color = "grey")) +
                scale_color_manual(values = c("negative" = "red",
                                              "neutral" = "blue",
                                              "positive" = "green")) +
                labs(title = "Sentiment Change Over Time",
                   x = "Time",
                   y = "Number of Reviews"))
# Change legend position to inside the plot
sent_time <- sent_time %>%
  layout(legend = list(x = 0.1, y = 0.9, 
                       xanchor = "left", yanchor = "top"))
sent_time

The sentiment distribution over time shows a seasonal pattern, were there is a rise in positive reviews in the 2nd and 4th quarters of the year, and a downward trend during the the first quarter. While the negative reviews show a rise during the last quarter and a decline during the first quarter.

Are there any specific products that consistently receive positive or negative reviews?

Click to show code
# Group reviews by product and calculate sentiment distribution
product_sentiment <- amzn_cleaned %>%
    group_by(ProductId, Sentiment) %>%
    summarize(Count = n())
# Filter products with only Positive sentiment
consistently_positive <- product_sentiment %>%
  filter(Sentiment == "positive" & Count > 0) %>%
  arrange(desc(Count))
# Display top 50 ProductIDs with the highest positive reviews
datatable(
  consistently_positive[1:50,],
  options = list(pageLength = 25,
                 scrollY = "500px",
                 scrollX = TRUE),
  caption = "Top 50 ProductIDs with highest positive reviews.")
Click to show code
# Filter products with only Negative sentiment
consistently_negative <- product_sentiment %>%
  filter(Sentiment == "negative" & Count > 0) %>%
  arrange(desc(Count))
# Display top 50 ProductIDs with the highest negative reviews
datatable(
  consistently_negative[1:25,],
  options = list(pageLength = 50,
                 scrollY = "500px",
                 scrollX = TRUE),
  caption = "Top 50 ProductIDs with highest negative reviews.")

The first table shows the list of products that have high number of positive reviews, while the second table shows the products with the highest number of negative reviews.

Do certain review lengths tend to have more positive or negative sentiment?

Click to show code
# Calculate the review lengths
amzn_cleaned$ReviewLength <- nchar(amzn_cleaned$Text)
# Group reviews by review length and calculate sentiment distribution
review_length_sentiment <- amzn_cleaned %>%
  group_by(ReviewLength, Sentiment) %>%
  summarize(Count = n())
# Plot sentiment vs. review length
sent_len <- ggplot(review_length_sentiment, 
                   aes(x = ReviewLength,
                       y = Count, 
                       color = Sentiment)) +
              geom_boxplot() + scale_y_log10() +
              labs(title = "Sentiment vs. Review Length",
                           x = "Review Length",
                           y = "Number of Reviews in (Log 10) Scale") +
              theme(text = element_text(size = 14),
                    axis.text.x = element_text(size = 12),
                    axis.text.y = element_text(size = 12),
                    legend.position = "top",
                    legend.title = element_blank(),
                    legend.text = element_text(size = 14)) +
              scale_color_manual(values = c("positive" = "green",
                                              "negative" = "red",
                                              "neutral" = "blue"))
sent_len

Negative reviews tend to be shorter than positive reviews, while neutral reviews tend to be long. It also emphasizes that there are more positive reviews then negative reviews in the data set.

How does the sentiment differ between different user ratings (e.g., 1-star, 5-star)?

Click to show code
# Group reviews by user rating to calculate sentiment distribution
rating_sentiment <- amzn_cleaned %>%
    group_by(Score, Sentiment) %>%
    summarize(Count = n())
# Plot sentiment by user rating
sent_rating <- ggplot(rating_sentiment, aes(x = as.factor(Score), 
                                          y = Count,
                                       fill = Sentiment)) +
            geom_bar(stat = "identity", position = "dodge") +
            theme(text = element_text(size = 14),
                  axis.text.x = element_text(size = 12),
                  legend.position = "top",
                  legend.title = element_blank(),
                  legend.text = element_text(size = 14)) +
            labs(title = "Sentiment Variation Across User Ratings",
                 x = "Customer Rating",
                 y = "Number of Reviews",
                 fill = "Sentiment") +
            scale_fill_manual(values = c("positive" = "green", 
                                          "negative" = "red", 
                                          "neutral" = "blue"))
sent_rating

Customers are generous with ratings when they are happy. Positive reviews will most likely receive a 5-star and negative reviews a 1-star.


Conclusion

1. The lack of sales data for each product, which could have been used to calculate the sentiment-to-sales ratio for each product. This ratio would provide insights into how positively the products are perceived by customers relative to their sales performance.

2. The data lacks geographical data which can be used to calculate how sentiments differ base on geographical location and how that could affect sales.

3. The data lacks categorization of products, which could be used to identify product categories with consistent high positive or negative sentiments and how that could affect sales performance.


Link to Github